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Abstract 

Quantifying cooperation among random variables in predicting a single 
target random variable is an important problem in many biological systems 
with 10s to 1000s of co-dependent variables. We review the prior literature 
of information theoretical measures of synergy and introduce a novel synergy 
measure, entitled synergistic mutual information, defined as the difference 
between the whole and the union of its parts. We apply all four measures 
against a suite of binary circuits to demonstrate that our measure alone 
quantifies the intuitive concept of synergy across all examples. 



1 Introduction 

Synergy is a fundamental concept in complex systems which that has received much attention 
in computational biology [1,2]. Several papers [3-6] have proposed measures for quantifying 
synergy, but there remains no consensus which measure is most valid. 

The concept of synergy spans many fields and theoretically could be applied to any non- 
subadditive function. But within the confines of Shannon information theory, synergy — 
or more formally, synergistic information — is a property of a set of n random variables 
X = {Xl,X2, . . . ,X n } cooperating to predict, that is reduce the uncertainty of, a single 
target random variable Y. 

One clear application of synergistic information is in computational genetics. It is well 
understood that most phenotypic traits are influenced not only by single genes but by 
interactions among genes — for example, human eye-color is cooperatively specified by more 
than a dozen genes [7]. The magntitude of this "cooperative specification" is the synergistic 
information between the set of genes X and a phenotypic trait Y, here eye color. Another 
application is neuronal firings where potentially thousands of presynaptic neurons influence 
the firing rate of a single post-synaptic (target) neuron. Yet another application is discovering 
the "informationally synergistic modules" within a multi-scale complex system. 

For pedagogical purposes all examples in the main text are determinstic, however, these 
methods equally apply to non-deterministic systems. 

The prior literature [8, 9] has termed several distinct concepts as "synergy". This paper 
defines synergy as whole much the whole is greater than (the union of) its atomic elements 
(eq. (15)). The other notions of synergy are treated in our companion paper, "Quantifying 
the irreducibility of mutual information". 

1.1 Notation 

We use the following notation throughout. Let 
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(a) n = 2 



(b) n = 3 



Figure 1: Pi-diagrams for two and three predictors. Each Pi-region represents nonnegative 
information about Y. A Pi-region's color represents whether its information is redundant 
(yellow), unique (magenta), or synergistic (cyan). To preserve symmetry, the Pi-region 
"{12,13,23}" is displayed as three separate regions each marked with a "*". All three 
*-regions should be treated as through they are a single region. 



n: The number of predictors Xi, X 2 , . . . , X n . n >2. In genetics, X\ . . . X n represent 
n distinct genes. 

Xi... n : The joint random variable (coalition) of all n predictors XiX 2 • . . X n . 
Xii The z'th predictor random variable (r.v.). 1 < i < n. 
X: The set of all n predictors {Xi, X 2 , . . . , X n }. 

Y: The target r.v. to be predicted. In genetics, Y represents a phenotypic trait (e.g. 
eye-color). 

y: A particular state of the target r.v. Y. In genetics, y is a particular state of the 
phenotype (e.g. eye-color = blue). 

In this paper all random variables are discrete, all logarithms are log 2 , and all calcu- 
lations are in bits. Entropy and mutual information are as defined by Shannon [10], 

H(X) = ExeX P r(a; )log^,andI(X:r) = E^Pr(x,y)logp|^. 
1.2 Understanding Pi-diagrams 

Partial information diagrams (Pi-diagrams), introduced by Williams and Beer [6], extend 
Venn diagrams to properly represent synergy. Their framework has been invaluable to the 
evolution of our thinking on synergy. 

A Pi-diagram is composed of nonnegative partial information regions (Pi-regions). Unlike 
the standard Venn entropy diagram in which the sum of all regions is the joint entropy 
H(Xi... n , y), in Pi-diagrams the sum of all regions (i.e. the space of the Pi-diagram) is the 
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mutual information I(Xi... n :Y). Pi-diagrams are immensely helpful in understanding how 
the mutual information I(Xi_ n : Y) is distributed across the coalitions and singletons of X. 1 

How to read PI- diagrams. Each Pi-region is uniquely identified by its "set notation" 
where each element is denoted solely by the predictors' indices. For example, in the PI- 
diagram for n = 2 (Figure la): {1} is the information about Y only X\ carries (likewise {2} 
is the information only X2 carries); {1,2} is the information about Y that X\ as well as X2 
carries, while {12} is the information about Y that is specified only by the coalition (joint 
random variable) X1X2, while the entire disk corresponds to I(XiX2 -Y). 

The general structure of a Pi-diagram becomes clearer after examining the Pi-diagram for 
n = 3 (Figure lb). All Pi-regions from n = 2 are again present. Each predictor (Xi,X2,Xs) 
can carry unique information (regions labeled {1}, {2}, {3}), carry information redundantly 
with another predictor ({1,2}, {1,3}, {2,3}), or specify information through a coalition with 
another predictor ({12}, {13}, {23}). New in n = 3 is information carried by all three 
predictors ({1,2,3}) as well as information specified through a three-way coalition ({123}). 
Intriguingly, for three predictors, information can be provided by a coalition as well as 
a singleton ({1,23}, {2,13}, {3,12}) or specified by multiple coalitions ({12,13}, {12,23}, 
{13,23}, {12,13,23}). 

2 Information can be redundant, unique, or synergistic 

Each Pi-region represents an irreducible nonnegative slice of the mutual information 
I(^i...n : ^) that is either: 

1. Redundant. Information carried by a singleton predictor as well as available 
somewhere else. For n = 2: {1,2}. For n = 3: {1,2}, {1,3}, {2,3}, {1,2,3}, {1,23}, 
{2,13}, {3,12}. 

2. Unique. Information carried by exactly one singleton predictor and is available no 
where else. For n = 2: {1}, {2}. For n = 3: {1}, {2}, {3}. 

3. Synergistic. Any and all information in I(Xi... n :Y) that is not carried by a 
singleton predictor, n = 2: {12}. For n = 3: {12}, {13}, {23}, {123}, {12,13}, 
{12,23}, {13,23}, {12,13,23}. 

Although a single Pi-region is either redundant, unique, or synergistic, a single state of 
the target can have any combination of nonzero Pi-regions. Therefore a single state of the 
target can convey redundant, unique, and synergistic information. This surprising fact is 
demonstrated in Figure 8. 

2.1 Example Rdn: Redundant information 

If X\ and X2 carry some identical 2 information (reduce the same uncertainty) about Y, then 
we say the set X = {X\,X2\ has some redundant information about Y. Figure 2 illustrates 
a simple case of redundant information. Y has two equiprobable states: r and R (r/R for 
"redundant bit"). Examining X\ or X2 identically specifies one bit of Y, thus we say set 
X = {Xi,X2} has one bit of redundant information about Y. 

2.2 Example Unq: Unique information 

Xi has unique information has about Y if and only if predictor X{ specifies information 
about Y that is not specified anywhere else (a singleton or coalition of the other n — 1 

*To whom correspondence should be addressed. Email: virgil@caltech.edu 
1 Formally, how the mutual information is distributed across the set of all nonempty antichains 
on the powerset of X [11, 12]. 

2 Xi and X 2 providing identical information about Y is different from providing the same amount 
of information about Y, i.e. I(X\\Y) — l(X 2 :Y). Example Unq (Figure 3) is an example where 
I(Xi : Y) = l(X 2 : Y) = 1 bit yet X\ and X 2 specify "different bits" of Y. Providing the same amount 
of information about Y is neither necessary or sufficient for providing some identical information 
about Y. 
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predictors). Figure 3 illustrates a simple case of unique information. Y has four equiprobable 
states: ab, aB, Ab, and AB. X\ uniquely specifies bit a/A, and X 2 uniquely specifies bit b/B. 
If we had instead labeled the F-states: 0, 1, 2, and 3, X\ and X2 would still have strictly 
unique information about Y. The state of X 1 would specify between {0, 1} and {2,3}, and 
the state of X2 would specify between {0,2} and {1,3} — together fully specifying the state 
ofF. 



2.3 Example Xor: Synergistic information 

A set of predictors X = {Xi, . . . , X n } has synergistic information about Y if and only if the 
whole (Xi... n ) specifies information about Y that is not specified by any singleton predictor. 
The canonical example of synergistic information is the XoR-gate (Figure 4). In this example, 
the whole X1X2 fully specifies Y, 

I(X 1 X 2 :F)=H(F) = lbit, (1) 

but the singletons X\ and X 2 specify nothing about Y, 

I(Xi : Y) = I(X 2 :Y) = bits. (2) 

With both Xi and X 2 themselves having zero information about Y, we know that there can 
not be any redundant or unique information about Y — Pi-regions {1} = {2} = {1,2} = 
bits. As the information between XiX 2 and Y must come from somewhere, by elimination 
we conclude that X\ and X 2 synergistically specify Y. 



3 Three examples elucidating properties of synergy 

To aid the reader in developing intuition for any proper measure of synergy we illustrate 
some desired properties of synergistic information with pedagogical examples. All three 
examples derive from example Xor. Readers solely interested in the contrast with prior 
measures can skip to Section 4. 
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(a) Pr(xi,x 2 ,y) 
Figure 2: Example Rdn. 
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(b) circuit diagram 




(c) Pi-diagram 



Figure 2a shows the joint distribution of r.v.'s Xi, 
X 2 , and y, Pr(xi, x 2 , y), revealing that all three terms are fully correlated. Fig- 
ure 2b represents the joint distribution as an electrical circuit. Figure 2c is the PI- 
diagram indicating that set {Xi,X 2 } has 1 bit of redundant information about Y. 

i(XxX 2 :y) = i(x i: y) = i(x 2 :y) = H(y) = 1 bit. 
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(a) Pr(xi,x 2 ,2/) 
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(b) circuit diagram 
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(c) Pi-diagram 



Figure 3: Example Unq. Xi and X 2 each uniquely specify a single bit of Y . 
I(XxX 2 :y) = H(y) = 2 bits. 
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(a) Pr(xi,x 2 ,y) 




(b) circuit diagram 



Figure 4: Example Xor. Xi and X 2 synergistically specify Y. l(XiX 2 :Y) 
bit. 



c) Pi-diagram 

H(F) 



3.1 XorDuplicate: Duplicating a predictor does not change synergistic 
information 

Example XorDuplicate (Figure 5) adds a third predictor, X 3 , a copy of predictor Xi, 
to Xor. Whereas in Xor the target Y is specified only by coalition XiX 2: duplicating 
predictor X\ as X3 makes the target equally specifiable by coalition XsX 2 . 

Although now two different coalitions identically specify Y, mutual information is invariant 
to duplicates, e.g. l(XiX 2 Xs'-Y) = l(XiX 2 :Y) bit. Likewise for synergistic information to 
be likewise bounded between zero and the total mutual information I(Xi... n : Y"), synergistic 
information must similarly be invariant to duplicates, e.g. the synergistic information between 
set {Xi,X 2 } and Y must be the same as the synergistic information between {Xi, X 2 , X%} 
and Y . This makes sense because if synergistic information is defined as the information 
in the whole beyond its parts, duplicating a part does not increase the net information 
provided by the parts. Altogether, we assert that duplicating a predictor does not change the 
synergistic information. Without the property that duplicating a predictor doesn't change 
synergistic information, the synergistic mutual information will not be bounded between 
andI(Xi... n :y). 

3.2 XorLoses: Adding a new predictor can decrease synergy 

Example XorLoses (Figure 6) adds a third predictor, X 3 , to Xor and concretizes the 
distinction between synergy and "redundant synergy". In XorLoses the target Y has one 
bit of uncertainty and just as in example Xor the coalition X\X 2 fully specifies the target, 
l(XiX 2 :Y) = H(y) = 1 bit. However, XorLoses has zero intuitive synergy because the 
newly added singleton predictor, X3, fully specifies Y by itself. This makes the synergy 
between X\ and X 2 completely redundant — everything the coalition X\X 2 specifies is now 
already specified by the singleton X 3 . 



4 Prior measures of synergy 

4.1 I max synergy: <S max (X : Y) 

Imax 

synergy, denoted <S max , derives from [6]. 
(state-dependent) maximum of its parts, 



<S m ax defines synergy as the whole beyond the 



<$max (X • Y) = I(Xi... n : Y) — I max {{Xi : . . . , X n } : Y) 

= I(Xi...„ : Y) - V Pr(F = y) max l(X { :Y = y) , 



yev 



where l(Xi'.Y = y) is [13] 's "specific-surprise", 

l(X i :Y = y) = D KL [Pv(X i \y 

= ^2 Pr(a?i|2/)log 
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(c) Pi-diagram 



Figure 5: Example XorDuplicate shows that duplicating predictor X\ as X3 turns the 
single-coalition synergy {12} into the mult i- coalition synergy {12,23}. After duplicating 
Xl, the coalition X3X2 as well as coalition X1X2 specifies Y. Synergistic information is 
unchanged from Xor, l(X 3 X 2 :Y) = I(XiX 2 :y) = H(Y) = 1 bit. 



There are two major advantages of <S m ax synergy. First, S max obeys the bounds of 
< 5max(^i...n : Y) < I{X\.„ n \Y). Second, 5 max is invariant to duplicate predic- 
tors. Despite these desired properties, <S max miscategorizes merely unique information as 
synergistic whenever two or more predictors have unique information about the target. This 
can be seen in example Unq (Figure 3). In example Unq the wires in Figure 3b don't 
even touch, yet <S max asserts there is one bit of synergy and one bit of redundancy — this is 
palpably strange. 

The common defense of <S max against example Unq is to say one should "break up" Y into 
its components a/A and b/B and then compute <S m ax for each component. Unfortunately 
this does not fully solve the problem because we often do not have the ability to "break up" 
y. For instance, if the F-states in Unq were instead labeled as: 0, 1, 2, and 3, we wouldn't 
have the ability to break Y into its components. 

A more abstract way to understand why <S max would overestimate synergy — imagine a 
hypothetical example where there are exactly two bits of unique information for every state 
y G Y and no synergy or redundancy. <S max would be the whole (both unique bits) minus the 
maximum over both predictors — which would be the max [1, 1] = 1 bit. The synergy 
would then be 2 — 1 = 1 bit of synergy — even though by definition there was no synergy — but 
merely two bits of unique information. 

Altogether, we conclude that <S max overestimates the intuitive synergy by miscategorizing 
merely unique information as synergistic whenever two or more predictors have unique 
information about the target. 
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(a) Pr(a;i,a;2,a:3,y) 
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(b) circuit diagram 




(c) Pi-diagram 

Figure 6: Example XorLoses. Target Y is fully specified by the coalition X1X2 as well 
as by the singleton X 3 . ^X^'Y) = 1(X 3 :Y) = H(Y) = 1 bit. Therefore the 
information synergistically specified by coalition X1X2 is a redundant synergy. 



4.2 WholeMinusSum synergy: WMS (X : Y) 

The earliest known sightings of the bivarate case of WholeMinusSum synergy (WMS) is 
in [14, 15] and the general case in [16]. WholeMinusSum synergy is a signed measure where a 
positive value signifies synergy and a negative value signifies redundancy. WholeMinusSum 
synergy is defined by eq. (7) and interestingly reduces to eq. (10) — the difference of two total 
correlations (i.e. TC(X i; -- - ; X n ) = - H(Xi... n ) + ^=1 H (^)) I 17 ]- 



WMS(X:F) = I(X 1 ... n :F)-^I(X,:F) 

i=l 

n n 

= H(Xi...„) - U(X 1 ... n \Y) - H(*i) + Yl K( x i\ Y ) 



TC(X i; -- - ;X n |F)-D K L 



Pr(X!... n ) 



II Pr (^) 



TC(X i; -.- ;X n |r)-TC(X i; -.. ;X n ) 



(7) 

(8) 

(9) 
(10) 
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Writing eq. (7) for n = 2 as a Pi-diagram (Figure 7a) reveals that for n = 2 WMS is 
the synergy between X\ and X 2 minus their redundancy. Thus, if there were an equal 
magnitude of synergy and redundancy between X\ and X2 (as in RdnXor, Figure 8), 
WholeMinusSum synergy would be zero — leading one to erroneously conclude there is no 
synergy or redundancy present. 3 WholeMinusSum's Pi-diagram for n = 3 (Figure 7b) reveals 
that for n > 2, WMS (X : Y) becomes synergy minus the redundancy counted multiple times 
(the example ParityRdnRdn in Appendix A demonstrates this). 

Thus WholeMinusSum underestimates the intuitive synergy for all n with the potential gap 
increasing with n. Equivalent ly, we say that WholeMinusSum synergy is a lowerbound on 
the intuitive synergy with the bound becoming looser with larger n. For example, for n = 2 
(Figure 7a) WholeMinusSum double-subtracts Pi-region {1,2}, but for n = 3 (Figure 7b) 
WholeMinusSum double-subtracts Pi-regions {1,2}, {1,3}, {2,3} and triple-subtracts PI- 
region {1,2,3}. 



Figure 7: Pi-diagrams representing WholeMinusSum synergy for n = 2 (left) and n = 3 
(right). For this diagram the colors merely denote the added and subtracted Pi-regions. 
WMS (X : Y) is the green Pl-region(s), minus the orange Pl-region(s), minus two times any 
red Pi-region. 

A concrete example demonstrating WholeMinusSum's "synergy minus redundancy" behavior 
is example RdnXor (Figure 8) which overlays examples Rdn and Xor to form a single 
system. The target Y has two bits of uncertainty or entropy, i.e. H(F) = 2. Like Rdn, 
either X\ or X 2 identically specifies the letter of Y (r/R), making one bit of redundant 
information. Like Xor, only the coalition X\X 2 specifies the digit of Y (0/1), making one 
bit of synergistic information. Together this makes one bit of redundancy and one bit of 
synergy. 

Note that in RdnXor every state y E Y conveys one bit of redundant information and one 
bit of synergistic information, e.g. for the state y = rO the letter "r" is specified redundantly 
and the digit "0" is specified synergist ically. Example RdnUnqXor (Appendix A) extends 



3 This is different from [3]'s point that a mish-mash of synergy and redundancy across different 
states of y £ Y can average to zero. Figure 8 evaluates to zero for every state y G7. 
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RdnXor to demonstrate redundant, unique, and synergistic information for every state 
yeY. 
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(a) Pi(x 1 ,x 2 ,y) 
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(b) circuit diagram 




(c) Pi-diagram 



Figure 8: Example RdnXor has one bit of redundancy and one bit of synergy. For this 
example, WMS(X :Y) = bits. 



4.3 Correlational importance: AI(X;Y") 

Correlational importance, denoted A I, comes from [5,18-21]. Correlational importance 
quantifies the "informational importance of conditional dependence" or the "information 
lost when ignoring conditional dependence" among the predictors decoding target Y . As 
conditional dependence is necessary for synergy, A I seems related to our intuitive conception 
of synergy. A I is defined as, 



AI(X;y) = D KL [Pr(y|X 1 ... n )|Pr ind (y|X) 



> Pr(y, x lm .. n ) log — -—— , 



(ii) 

(12) 



where Pr in( j (vl x ) == 9tc^t4 a ■ After some algebra 4 eq. (12) becomes, 

Z^ ! ,' P W- ) lli=i Pl l Xi l ?/ ) 



AI(X;r) = TC(X i; -.. ;X n |F)-D K L 



Pr(Xi...„) 



^Pr^nPr^ly) 



which strikingly resembles WholeMinusSum eq. (9) reproduced below, 



(13) 



WMS(X:F) =TC(X i; -.. ;X n |F)-D K L 



Pr(Xi...„) 



n pr (^) 



Eqs. (9) and (13) have the same upperbound of TC (Xl; • • • ;X n \Y) and furthermore are 
algebraically identical up to the righthand-side of the KL-divergence. Such uncanny similari- 
ties has led some to think that A I quantifies some kind of synergistic information; indeed, 
there has been heated debate [3,21] contrasting WMS and A I. 

A I is conceptually innovative and moreover agrees with our intuition for almost all of our 
examples. Yet further examples reveal that A I measures something ever-so-subtly different 
from intuitive synergistic information. 



4 See Appendix B for the algebraic steps between eqs. (12) and (13). 
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The first example is [3]'s Figure 4 where A I exceeds 5 the mutual information I(Xi... n :F) 
with A I (X; Y) = 0.0145 and l(X lmmmn :Y) = 0.0140. This fact alone prevents interpreting A I 
as a loss of mutual information from I(Xi... n : Y). Although A I can not be a loss of mutual 
information, it could still be a loss of some alternative information (like Wyner's common 
information [22,23]). 

Instead, could A I upperbound synergy? We turn to example And (Figure 9). Example 
And has n = 2 independent predictors and target Y is the AND of X\ and X^. Although 
And's Pi-region decomposition is subtler than Xor, we can still intuit its decomposition by 
a fortunate special case. 

For X\ and X2 to redundantly specify Y, X\ and X2 themselves must have some information 
about each other. 6 However, because X\ and X2 are independent, 1{X\'.X2) = bits, 
there must be zero redundant information — meaning Pi-region {1,2} = bits. 7 With zero 
redundancy, the unique information Pi-regions are simply the mutual information between 
the singletons and the target, {1} = I{X±:Y) = 0.311 bits and {2} = 1{X 2 'Y) = 0.311 
bits — these are computed using the uniform distribution per Figure 9a. From there, the 
synergy (Pi-region {12}) is simply the whole, l(XiX2'Y), minus the unique Pi-regions ({1} 
and {2}) and redundant Pi-region ({1,2}) for 0.811 - 0.311 - 0.311 = 0.189 bits of synergy. 




(b) circuit diagram (c) Pi-diagram 



Figure 9: Example And. Xi and X2 each have 0.311 bits of unique information. Addi- 
tionally, X\ and X2 synergistically specify 0.189 bits, and redundantly specify zero bits. 
\{X X X 2 'Y) = H(y) = 0.811 bits. 

In example And the WMS synergy — the lowerbound on the intuitive synergy — is ^0.189 
bits, yet A I (X; Y) = 0.104 bits, and we conclude that A I does not upperbound synergy. 

Finally, in the face of duplicate predictors A I often decreases. From example And to 
AndDuplicate (Section 4.4, Figure 10) A I drops 63% to 0.038 bits. 

Taking all three examples together, we conclude A I measures something fundamentally 
different from synergistic information. 

4.4 AndDuplicate: One example to rule them all 

Our final example, AndDuplicate (Figure 10), reveals undesirable behavior in all three 
prior measures. Example AndDuplicate adds a duplicate predictor to example And to 
show how each synergy measure responds to a duplicate predictor in a less pristine example 
than Xor. Before in XorDuplicate, we saw that when duplicating predictor Xi, the 
synergistic information was unchanged. But unlike Xor, in example And both X\ and X2 
have unique information — what happens to those two unique informations when duplicating 
a predictor? Most importantly, would either reduce synergy in the spirit of XorLoses? 
Taking each one at a time: 

5 As AI(X;Y) is often normalized by \{X\ n'-Y), it's concerning that AI(X;y) can exceed 
I(Xi... n :Y). 

6 A way to conceptualize this is that for two predictors to have redundant information about 
a target, the two predictors themselves must have some overlapping/redundant entropy, for two 
independent predictors this is H(Xi) + H(X2) — H(XiX2) = overlapping bits. 

7 Our assumption that positive redundant information between X\ and X2 requires positive 
I(Xi '.X2) is disputed by [24]. A forthcoming publication [25] gives credence to our assumption. 
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• Predictor X 2 is unaltered from example And. Thus X 2 s unique information stays 
the same. And's {2} -> AndDuplicate's {2}. 

• Predictor X3 is identical to X\. Thus all of Xi's unique information in And 
becomes redundant information between predictors X\ and X3. And's {1} — > 
AndDuplicate's {1,3}. When duplicating a predictor, the predictor's unique 
information becomes redundant information. 

• In And there is synergy between X\ and X 2: and this synergy is still present in 
AndDuplicate. Just as in XorDuplicate, the only difference is that now an iden- 
tical synergy also exists between X 3 and X 2 . Thus And's {12} -> AndDuplicate's 
{12,23}. 

• Predictor X3 is identical to X\. Therefore any information in And that is specified 
by both X\ and X 2 would now be specified by X±, X 2 , and X3. Thus And's {1, 2} — )> 
AndDuplicate's {1,2,3}. 

Inspecting the finished Pi-diagram in Figure 10c we see that duplicating a predictor leaves 
the intuitive synergistic information unchanged. To what conclusions do the three measures 
of synergy in the literature come? 

WholeMinusSum synergy arrives at 0.189 bits of synergy for And, but 0.189-0.311 = -0.123 
bits for AndDuplicate. This again shows that WholeMinusSum subtracts redundancy 
from synergy, and is not invariant to duplicate predictors. 

Correlational importance, A I, arrives at 0.104 bits of synergy for And, but 0.038 bits for 
AndDuplicate. A I is not invariant to duplicate predictors. 

<5 m ax arrives at 0.5 bits of synergy for both And and AndDuplicate. This is expected 
because <S max is provably invariant to duplicate predictors — the only problem with <S max 's 
answer for AndDuplicate is that it inherits the same overestimated synergy from And 
(discussed in Section 4.3). 

5 Synergistic mutual information 

We are all familiar with the English expression describing synergy as the whole being 
greater than the "sum of its parts". Although this informal adage, formalized by WMS 
synergy, captures the intuition behind synergy, we saw that WholeMinusSum "double-counts" 
whenever there is duplication (redundancy) among the parts. A mathematically correct 
adage should change "sum" to "union" — meaning synergy occurs when the whole is greater 
than the union of its parts. Summing adds duplicate information multiple times, whereas 
union adds duplicate information only once. The union of the parts never exceeds the sum. 

This guiding intuition of "whole minus union" leads us to a novel measure entitled synergistic 
mutual information, denoted S ({Ai, . . . , X n } : Y) , or S (X : Y), as the mutual information 
in the whole that is not in the union of its parts. 

Unfortunately, there's no measure of 'union-information" in contemporary information theory. 
We introduce a novel technique, derived from [26,27], for defining the union information 
among n predictors. We numerically compute the union information by passing Y through 
a channel (particularly, a square transition matrix) that preserves only the bits that are 
specified by singleton predictors. This is achieved like so, 

l u ({X u ...,X n }:Y)= min l(X lm .. n :Y') (14) 

V y Pt(Y'\Y) V y 

subject to Xi.„ n ->Y -+Y' 

l(Xi'.Y f ) =l(X l :Y) Vi. 

The constraint Xi... n -^Y^Y' is a Markov chain placing Y between X\ mmm7l and Y' . This 
Markov chain ensures that all information between X\ mmm7l and Y' is also between X\ mmm7l 
and Y — thus l(Xi... n : Y 7 ) < I(Xi,„ n :Y). An equivalent way of conceptualizing this Markov 
chain is that it forces the joint distribution Pr(xi... n , y, y') = Pr(#i... n , y) Pi(y f \y) . Without 
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(c) Pi-diagram 



Figure 10: Example AndDuplicate. The total mutual information is the same as in And, 
I(XiX 2 :F) = I(XiX 2 X 3 :F) = 0.811 bits. Every Pi-region in example And (Figure 9c) 
maps to a Pi-region in AndDuplicate. At 0.189 bits, the intuitive synergistic information 
is unchanged from And. 



another constraint, min^ n ^y^y l(A"i... n : F') is trivially found to be zero bits. This is 
because simply setting Y' to a constant would have H(Y"') = bits, thus necessitating 
I\Xi... n :Y f ) = bits. Therefore we must also explicitly state which bits of Y we wish to 
retain — those bits originating from the singletons {Xi, . . . , X n }. 

Taken together these constraints ensure that: (1) All information between Xj_. n and Y' is 
also between X\ mmm7l and Y; (2) l(Ai... n : Y 7 ) only contains bits that the constraints explicitly 
preserve — those between l(Xi'.Y) Vz. Finally, we prove that a minimum of eq. (14) always 
exists because setting Pr(F / |y) to the identity matrix satisfies all constraints. 8 

Unfortunately we currently have no analytic way to calculate Ij (eq. (14)). In practice we 
use MATLAB to perform gradient descent optimization using the function f mincon. We 
have explored some of the properties of the minimization in eq. (14). First, because of the 
constraint l(Xi :F') = l(Xi'.Y) Vi, the minimization is unfortunately not convex — therefore 
there's no straightforward way to verify we've found the minimum. However, for n = 2 
the minimization procedure can be rewritten as a convex minimization [25,27]. Second, 
the minimizing matrix Pr(F / |y) is not unique. We are exploring analytic solutions for Iu. 
Our union-information measure satisfies all desired properties for a union-measure from [24], 



8 Because of the specific nature of the minimization removing the Markov constraint Xi... n — >• 
Y — >• Y' doesn't alter the res ult of eq. (14), but the resulting equation becomes less intuitive and 
harder to numerically compute. 
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non- negativity, idempotency, symmetry among JQ's, nondecreasing with additional X^s, 
and equal to the entropy H(Xj_... n ) when Y = -Xi... n . 

Once the union information is computed, we define the synergistic mutual information among 
the n predictors as, 

S ({X! , . . . , X n } : Y) = I(Xx... n : Y) - I u ({X u . . . , X n } : Y) . (15) 

Synergistic mutual information quantifies the total "informational work" only coalitions 
perform in reducing the uncertainty of Y. Pleasingly, synergistic mutual information measure 
is bounded 9 by the WholeMinusSum synergy (which underestimates the intuitive synergy) 
and <S max (which overestimates intuitive synergy), 

max[0,WMS(X:F)] < S (X : Y) < <S max (X : Y) < I(X lm .. n :Y) . (16) 

Synergistic mutual information can be made state-dependent for a particular state y gF, 
i.e. <S(X :Y = y), Appendix C has the details. 

If there is no redundant information among the parts, which is guaranteed by the sim- 
ple condition Y^i=i ^-{Xi : -Xi...n\i) = 0? 10 the synergistic mutual information is equal to 
WholeMinusSum, S(X:Y) = WMS (X : Y). 

Conditional dependence among predictors X, Pr(Xi„. n |Y) 7^ f|^ =1 Pr(X^|Y") is necessary 
but not sufficient for set X to have synergistic information about Y. 

6 Applying the measures to our examples 

Table 1 summarizes the results of all four measures applied to our examples. 

Rdn (Figure 2). There is exactly one bit of redundant information and all measures reach 
their intended answer. 

Unq (Figure 3). <S m ax's miscategorization of unique information as synergistic information 
reveals itself. Intuitively, there are two bits of unique information and no synergy. However, 
£max reports one bit of synergistic information. 

Xor (Figure 4). There is one bit of synergistic information and nothing more. All measures 
reach the expected answer of 1 bit. 

XorDuplicate (Figure 5). Target Y is specified by the coalition X1X2 as well as by the 
coalition X 3 X 2 , thus I(X 1 X 2 '-Y) = l(X 3 X 2 :Y) = H(F) = 1 bit. All measures reach the 
expected answer of 1 bit. 

XorLoses (Figure 6). Target Y is fully specified by the coalition X1X2 as well as by the 
singleton X 3 , thus I(XiX 2 :F) = l(X 3 :Y) = H(Y) = 1 bit. Together this means there is 
one bit of redundancy between the coalition X1X2 and the singleton X3 as denoted by the 
+1 in Pi-region {3, 12}. All measures account for this redundancy and reach the expected 
answer of bits. 

RdnXor (Figure 8). This example has one bit of synergy as well as one bit of redundancy. 
In accordance with Figure 7a, WholeMinusSum measures synergy minus redundancy to 
calculate 1 — 1 = bits. On the other hand, <S max , A I, and S are not mislead by the 
co-existance of synergy and redundancy and correctly report 1 bit of synergistic information. 

And (Figure 9). This example is a simple case where correlational importance, AI(X; Y), 
disagrees with the intuitive value for synergy. The WholeMinusSum synergy — the lowerbound 
on the intuitive synergy — is 0.189 bits, yet A I (X; Y) = 0.104 bits. Furthermore, just as in 
example Unq, <S max again miscategorizes the second unique information as synergistic to 
overestimate the synergy arriving at 0.189 + 0.311 = 0.5 bits. 

9 Proven in Appendix D.2. 

10 This condition is much looser than full mutual independence among predictors. 
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Example 



Rdn 
Unq 
Xor 




1 
1 



-1 



1 






1 






1 



XorDuplicate 
XorLoses 



1 




1 





1 




1 




RdnXor 1 

And 1/2 
AndDuplicate 1/2 



1 1 

0.189 0.104 0.189 
-0.123 0.038 0.189 



XorMultiCoal 1 
RdnUnqXor 2 
ParityRdnRdn 1 



1 



-3 



1 
1 
1 



1 
1 
1 



Table 1: Synergy measures for our examples. Answers conflicting with the intuitive synergistic 
information are in red. All examples exploit special cases to analytically compute <S. Both 
analytically and numerically our measure S reaches the intuitive answer for every example. 

AndDuplicate (Figure 10). This example shows how the measures respond to duplicating 
a predictor for example And. As first demonstrated in example XorDuplicate, intuitive 
synergistic information is unchanged when duplicating a predictor. However, both WholeMi- 
nusSum and A I conflict with this intuition to decrease from And to AndDuplicate. In 
contrast, measures <S max and S are invariant when duplicating predictors. 

The three final examples XorMultiCoal, RdnUnqXor, and ParityRdnRdn aren't 
essential for understanding this paper and are discussed in Appendix A. 

7 Discussion 

Fundamentally, we assert that synergy quantifies how much a whole is exceeds the union 
of its parts. Considering synergy as the whole minus the sum of its parts inadvertently 
"double-subtracts" redundancies, thus underestimating synergy. Within information theory, 
Pi-diagrams, a generalization of Venn diagrams introduced in [6], are immensely helpful in 
improving one's intuition for synergy. 

Table 1 shows that no prior measure quantifies the intuitive notion of synergistic information 
in all cases. In fact, no prior measure consistently matches intuition even for n = 2. To 
summarize, 

1- Imax synergy, <S max , overestimates the intutive synergy when two or more predictors 
have unique information about the target (e.g. Unq). 

2. WholeMinusSum synergy, WMS, inadvertently double-subtracts redundancies and 
thus underestimates the intuitive synergy (e.g. RdnXor). Duplicating predictors 
turns unique information into redundant information thereby decreasing WholeMi- 
nusSum synergy. 

3. Correlational importance, A I, is not bounded by the Shannon mutual information. 
Duplicating predictors often decreases correlational importance (e.g. AndDupli- 
cate). Altogether, A I does not quantify the intuitive synergistic information (nor 
was it intended to). 

We demonstrate by examples (e.g. RdnXor and RdnUnqXor in Appendix A) that a single 
state can simutaneously carry redundant, unique, and synergistic information. This fact is 
underappreciated in the current literature. Prior work often implicitly assumed that these 
three types of information cannot coexist in a single state. 

We introduce an implicit analytical expression for synergistic mutual information (eq. (15)). 
Unfortunately our expression is not easily computable, and until we have an explicit analytic 
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derivation of the union information the best one can do is compute synergistic mutual 
information via numerical optimization techniques. Along with our examples, we consider 
our introduction of a necessary and sufficient criteria for the union information (eq. (14)) 
our primary contribution to the literature. 

We believe that our measure of synergy, synergistic mutual information, will be important in 
untangling informational relationships among the heavily interconnected molecular, genomic 
and neuronal networks found in evolved biological systems characterized by a high degree of 
robustness and redundancy. 
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Appendix 



A Three extra examples 

For the reader's intellectual pleasure, we include three more sophisticated examples: Xor- 
MultiCoal, RdnUnqXor, and ParityRdnRdn. Example RdnUnqXor extends example 
RdnXor to demonstrate redundant, unique, and synergistic information for every state 
y G Y. Example ParityRdnRdn illustrates how for n > 2, WholeMinusSum synergy 
subtracts redundancies multiple times. 



x 1 


x 2 


X 3 


Y 


ab 


ac 


be 





AB 


Ac 


Be 





Ab 


AC 


bC 





aB 


aC 


BC 





Ab 


Ac 


be 


1 


aB 


ac 


Be 


1 


ab 


aC 


bC 


1 


AB 


AC 


BC 


1 



1/8 
1/8 
1/8 
1/8 

1/8 
1/8 
1/8 




(b) circuit diagram 



(a) Pr(xi,x 2 ,x 3 ,2/) 




(c) Pi-diagram 

Figure 11: Example XorMultiCoal demonstrates how the same information can be specified 
by multiple coalitions. In XorMultiCoal the target Y has one bit of uncertainty, H(Y") = 1 
bit, and Y is the parity of three incoming wires. Just as the output of Xor is specified only 
after knowing the state of both inputs, the output of XorMultiCoal is specified only after 
knowing the state of all three wires. Each predictor is distinct and has access to two of the 
three incoming wires. For example, predictor X\ has access to the a/A and b/B wires, X 2 
has access to the a/A and c/C wires, and X3 has access to the b/B and c/C wires. Although 
no single predictor specifies Y, any coalition of two predictors has access to all three wires 
and fully specifies r,I(XiX 2 :F) = 1(^X3 :F) = I(X 2 X 3 :Y) = H(Y) = 1 bit. In 
the Pi-diagram this puts one bit in Pi-region {12, 13,23} and zero everywhere else. The 
amount of synergistic information is the same as Xor, and all measures reach the expected 
answer of 1 bit. 
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(a) ¥r(x 1 ,X2,y) 




(b) circuit diagram (c) Pi-diagram 

Figure 12: Example RdnUnqXor weaves examples Rdn, Unq, and Xor into one. 

1{X\X2'-Y) = H(y) = 4 bits. This example is pleasing because it puts exactly 
one bit in every Pi-region. 
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(a) Pr(x 1 ,x 2 ,x 3 ,y) 




(b) circuit diagram 



(c) Pi-diagram 



Figure 13: Example ParityRdnRdn has three predictors. The target Y has three bits 
of uncertainty — H(Y") = 3. Examining any singleton predictor specifies the letters in Y 
(ab/aB/Ba/AB), l(Xi'.Y) = 2 Vi, making two bits of redundant information. F's third and 
final bit (digit 0/1) is the parity of the digits of the three predictors and accordingly is 
specified only by the triplet coalition X1X2X3, making one bit of synergy. This example has 
two bits of maximum redundancy and one bit of synergy. I(XiX 2 X 3 :F) = H(Y) = 3 
bits. 
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B Algebraic simplification of AI 

Prior literature [5,19-21] defines AI(X;F) as, 



Where, 



AI(X;F) 



Dkl 
E x 



Pi(Y\X 1 ... n ) Pr ind (r|X) 



D KL [Pr(r|x) 



E Pr(x, y) log 



Pr ind (E|x) 

Pr(y|x) 
Pr ind (y|x) ' 



Pr ind (r = y\X = x) 



Pr(y)Pr ind (X = x|y = t/) 
Pr ind (X = x) 

pr(y)nr=i pr Ny) 



Pr ind (X = x) 



Ey 



Prind(x) 

n 

n pr (^i^) 



J2^(y = y)U Fr ( x i\y) 



i=i 



The definition of A I (eq. (17)) reduces to, 



(17) 
(18) 

(19) 

(20) 
(21) 

(22) 
(23) 



AI(X;F) 



E Pr ( x > v) lo s 



Pr(y|x) 
Pr ind (y|x) 

Pr(y|x) Pr ind (x) 



V Pr(x, y) log , , 

x.ytx.y Pr(»)Ili=iPr(«i|y) 

V- p , Pr(x|y) Pr ind (x) 

2^ Pr x, y) log , . 

xytxy IL=i Pr ( x ) 



(24) 
(25) 
(26) 



E Pr(x,y)log + £ Pr(x, y) log%M (27) 

^ Ili=i Pr(a?i|2/) Pr(x) 
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Pr(Xi... n |y) 



n^iPr(^|y) 

n 
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Pr in d(x) 



(29) 



D KL [Pr(X 1 ... n )||Pr ind (X)] (30) 



TC • • • ;X n \Y) - D KL [Pv(X 1 ... n )\\ Pr ind (X)] 



TC(X i; -- - ;X n |F)-D K L 



Pr(*i...») 



EPr(y)]l Pr ( x ^) 

y eY 1=1 



(31) 
.(32) 



where TC {X\ \ • • • ; X n |Y") is the conditional total correlation among the predictors given Y. 
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C State-dependent synergistic mutual information 

We define synergistic mutual information, <S, for a particular state of the target y G Y as, 



S({X 1 ,...,X n }:Y = y)=l(X 1 ... n :Y = y)-I u ({X 1 ,...,X n }:Y = y) , (33) 

where the state-dependent union-information (Ij) is, 

I u ({X 1 ,...,X n }:Y = y) = min l(X lm .. n :Y f = y') (34) 

Pr(y'\Y) 

subject to Xi... n —> Y y r 

l(X z :Y , = y , )=l(X z :Y = y) Vi . 

The state-dependent mutual information, e.g. l(Xi'.Y = y), is [13] 's "specific-surprise", 

l(X i :Y = y) = D KL [Pr(X^)|Pr(X0] (35) 

= Yl p K^ly)^g P ? ( twv ( 36 ) 

Pr(a?i)Pr(y) 

D Essential proofs 

These proofs underpin our essential claims about our introduced measure, synergistic mutual 
information. 

D.l Proof of equivalence to the Maurer method for n = 2 

First, an initial proof that, 

l(X 1 :X 2 \Y) =I(X 1 X 2 :Y) +l(X 1 :X 2 ) -liX.-.Y) -I(X 2 :Y) . 

Proof. 

liXiXi-.Y) - \{X X :Y) - \{X 2 :Y) = H(XiX 2 ) - B.{X X X 2 \Y) - H(X X ) + B.{X X \Y) - R(X 2 ) + H.(X 2 \Y) 

= I(X 1 :X 2 \Y)+R(X 1 X 2 )-R(X 1 )-R(X 2 ) (37) 
= l(X i: X 2 |F) -I(X i: X 2 ) (38) 
l(X i: X 2 |y) = I(X 1 X 2 :F)+I(X 1 :X 2 )-I(X 1 :y)-I(X 2 :y) . (39) 

□ 



Now we prove that for n = 2 the Maurer-method for computing synergy is equivalent to our 
method for computing synergy. We show that, 

l(X i: X 2 |F)- min l(X i: X 2 |F') = S{{X U X 2 }:Y) (40) 

a i A2 — y y — y y 

= I(XiX 2 :F)- min l(XiX 2 :F / ) . 

X 1 X 2 ^Y^Y' 
I(X 1 :Y')=I(X 1 :Y) 
I(X 2 :Y')=I(X 2 :Y) 
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Proof. 



l(X i: X 2 |F)- min I(X 1 :X 2 \Y') 

X 1 X 2 ^-Y^Y' s , 



mm 



l(X 1 :X 2 \Y) - 
l(X 1 :X 2 \Y) -I(X 1 :X 2 ) 



expand per eq. (39) 

l(X,X 2 :Y') + :X 2 ) - :Y') - l(X 2 :Y') 



mm 



I(X 1 X 2 :Y') -l(Xi:Y") -\{X 2 :Y') 



(41) 

(42) 
(43) 



xpand per eq. (39) 



l(X 1 X 2 : Y) - \{X l : Y) - l(X 2 :Y) — min 

a — ^ y — y y 



i(XiX 2 :y") -i(x 2 :r') 

V 

decompose into Pi-regions 



We now decompose l(XiX 2 : F') - l(Xi :F') - l(X 2 :Y') into Pi-regions. 

• l(XiX 2 :r') is composed of Pi-regions: {12}, {1}, {2}, and {1,2}. 

• I(Xl : Y"') is composed of Pi-regions {1} and {1,2}. 

• l(X 2 -Y f> ) is composed of Pi-regions {2} and {1,2}. 

Thus the difference l(XiX 2 :Y') - l(X 1 :Y f ) - l(X 2 :Y f ) is Pi-regions {12} - {1, 2}. 



■ l{X 1 X2'-Y)-l{X 1 :Y)-l{X 2 '-Y)- min 

XiJ 



l(X 1 X 2 :Y') -l(X 2 :Y') 

V v 

PI-regions: {12} - {1, 2} 



(44) 

As the minimum of eq. (44) is the synergy (Pi-region {12}) minus the redundancy (Pi-region 
{1,2}), we can add any constraints we wish to the minimization mmx 1 x 2 ^Y^Y / that do 
not increase the synergy or decrease the redundancy. We choose to add the constraints 
l(X 1 :Y f ) = I(X 1 :Y) and l(X 2 :Y"') = I(X 2 :F). This gives us, 



l(XxX 2 : Y') - l(Xx : Y') - l(X 2 : Y') 



=/(Xi:y) =I(X 2 :Y) 



I(XiX 2 :y)-I(Xi:y)-I(X 2 :y)- min 

I(X 1 :Y , )=I(X 1 :Y) 
I{X 2 :Y')=I{X 2 :Y) 

I(XiX 2 : Y) - I(Xi : Y) - I(X 2 : Y) + I(Xi : Y) + I(X 2 : Y) min I (XxX 2 : Y r ) (45) 

X 1 X 2 ^Y^Y' 
I(X 1 :Y')=I(X 1 :Y) 
I(X 2 :Y')=I(X 2 :Y) 



= I(XxX 2 :y) 



min l(XiX 2 :F / ) 

XxX^Y^rY' 

I(X 1 :Y , )=I(X 1 :Y) 
7(A' 2 :y / )=^(^2:n 



= 5({X l5 X 2 }:y) . 
And the proof of eq. (40) is complete. 



(46) 
(47) 



□ 
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D.2 Proof of bounds of <S(X : Y) 

We show that, 

WMS (X : Y) < S (X : Y) < <S max (X : Y) < I(Xi... n : Y) . 

D.2.1 Proof that 5 max (X : Y") < I(Xi...„ : Y) 
Proof. 

<S max (X:F) ee I(Xi... n :r)-VPr(y)maxI(X j :y = y) 

z ' * 

= I(X 1 ... n :F)- VPr(^)maxDK L fPr(X^) Pr(X,) 

yGY ^ 



>0 



>0 



< l(X^ n :Y) . 



D.2. 2 Proof that <S(X : Y) < <S max (X : Y) 

We invoke the standard definitions of S and <S max , 

«S(X:Y~) = I(X 1 ... n :F)-I u (X:r) 
<S max (X:Y~) = I(X 1 ... n :F)-I max (X:F) 



where Iy and I max are denned as, 

Iu(X:r) = EyIu(X:r = y) 



Ey min 

Xi... n ->y->j/' 

/(X i: y' =! y')=-r(^i:V-=y) Vi 



l(Xi... n :r' = I /') 



(48) 



(49) 
(50) 



(51) 
□ 



I max (X : y) = Ey max I(X; : Y = y) . 



(52) 
(53) 



(54) 
(55) 

(56) 



Now we prove 5(X : Y) < <S max (X : Y) by showing that Lj( X : Y) > I max (X : Y). 
Proof. 

E y Iu(X:r = y) > Ey I max (X:Y = y) (57) 
Ey [l u (X:y = y)-I max (X:y = y)] > 0. (58) 

Now expanding I y (X :Y = y) and I max (X : Y = y), 

( \ 



E v 



mm 

y/(X i :y'=y')=/(^«:V=!/) Vi 



\{X x ... n :Y' = y') 



max^X^ : Y = y) 



) 



> . (59) 



We define the index m G {1, . . . , n} such that m = argmaXj l(Xi :Y = y). The predictor 
with the most information about state Y = y is thus X m . This yields, 



Ey 



/ 



mm 

Xi...„->y->j/' 

y/(X i: y'=s/')=-f(^«:y=!/) Vi 



l(Xi... n :r'=y') 



T(X m :Y = y) 



/ 



> . (60) 
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The constraint :Y' = y') = l(Xi'.Y = y) entails that I(X m :Y = y) = l(X m :Y f = y f ). 
Therefore we can pull l(X m :Y = y) inside the minimization as a constant, 



Ey 



min l(Xi... n :y / = y') - l(X m :Y' = yf) 

Xl... n ^->?/ 

I(X i :Y'=y')=I{X i :Y=y) Vi 



> . (61) 



As X m is a subset of predictors Xj_. n , we can substract it yielding, 



E> 



mm 

I(X i :Y'=y')=I(X i :Y=y) 



l(^l...n\m : ^' = V X m^ 



Vi 



> 



(62) 



The state-dependent conditional mutual information l(^Xi... n \m -Y f = y' X m ^ is a Kullback- 

Liebler divergence. As such it is nonnegative. Likewise the minimum of a nonnegative 
quantity is also nonnegative. 



E> 



mm 

X lmmmn ->Y^y' 
I(X i :Y'=y')=I(X i :Y=y) Vi 



l(^l...n\m : ^' = V Xry^J 



>0 



>0 



> . 



(63) 



Finally, the expected value of a list of nonnegative quantities is itself nonnegative. And the 
proof that <S(X : Y) < <S max (X : Y) is complete. 



□ 



24 



D.2.3 Proof that WMS(X : Y) < <S(X : Y) 

We invoke the standard definitions of WMS and <S, 



WMS(XiF) = I(X 1 ... n :F)-^I(X z :F) (64) 



i=i 



S(X:Y) = I(X lm .. n :Y)-I u (S:Y) (65) 
= I(X lmm . n :Y)- min l(X 1 ... n :F / ) . (66) 

I(Xi:Y')=I(Xi-.Y) Vi 

We prove the conjecture WMS(X : Y) < <S(X : F) by showing, 

n 

l(Xi... n :r') <^I(X i: F) . (67) 



mm 

Xi... n -»y-»y 

7(X i :y / )= / (^ : ^) 



Proof. Given: 



min ; l(X 1 ... n :F / ) , (68) 

Xi... n — >"^— ^ 
I(X 1 :y / )=^(^i^) 



I{X n :Y')=I{X n :Y) 



the individual constraint l(Xi:F') = 1(X\\Y) can add at most I(X\'.Y) to l(Xi... n : F'). 
Therefore we can upperbound eq. (68) by dropping the constraint 1[X\ : Y 7 ) = 1{X\ :Y) and 
adding I(Xi'.Y). This yields, 

min l(X 1 ^ n :Y , )< min I(%... n :F') + 1{X X :Y) . (69) 

/(XiiYO^/CXiiY) /(XsiYO^/CXsiY) 
I(X n :Y')'=I(X n :Y) I (X n :Y')=I (X n :Y) 

Likewise, the righthand-side of eq. (69) can be upperbounded by dropping the constraint 
l(X 2 : Y') = I(X 2 : Y) and adding I(X 2 • Y). This yields, 



min l(X 1 ... n :F / ) < min l(Xi... n :F') + I(Xi'.Y) +I(X 2 :Y) . (70) 

Xi Xi 

/(XsiYO^/CXsiY) /(XaiYO^^s^) 
/(X n :Y / )= / (^n:^) I(X n :Y')=I(X n :Y) 

Repeating this process n times yields, 



min l(X lm .. n :Y') < min l(X lm „ n : F') + V I(X, : F) (71) 

Xi n ->Y->r' Xi... n ->y->y ^ 

7(X i :Y / )=^(^ : ^) Vz v v ' 



= ^I(X,:y) . (72) 

2=1 

And the proof is complete. □ 
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D.3 Proof that the union information is idempotent 



We show that the synergistic mutual information <S(X : Y) is invariant under adding an 
additional predictor G {Xj_, . . . , X n }. We assume without loss of generalization that the 
duplicated predictor is X\. We will show that, 

S({X 1 ,...,X n ,X 1 }:Y)=S({X 1 ,...,X n }:Y) . (73) 

Proof. We start with the expression for S ({Xl, . . . , X n , X\} : Y) , 

S({X 1 ,...,X n ,X 1 }:Y)=I(X 1 ... n X 1 :Y)- min ^X^X^.Y') . (74) 

I(X i :Y')=I(X i :Y) Vi 
I(X 1 :Y , )=I(X 1 :Y) 

The two mutual information terms do not change when duplicating predictor X\. This 
yields, 

S({X u ...,X n ,X 1 }:Y)=I(X 1 ... n :Y)- min l(X 1 ... n :F / ) . (75) 

/(Xi:y")=^(-x:i:y) vi 

Having the constraint l(Xi : Y') = I(Xi : Y") twice is superfluous. Therefore we can remove 
the latter one yielding, 

= I(X lmmmn :Y)- min l(X lm .. n :Y') . (76) 

X ± n X ± ^Y^Y' 
I(Xi:Y')=I(Xi:Y) Vi 

Finally, the Markov condition X\ m „ n X\ —> Y — >> Y' means that, 

Pr (xi... n xi, 2/, y') = Pr(a?i... n , x t ) Pr(i/|^i... n , Pr (?/|?/) (77) 
= Pr(x 1 ... n ,x 1 )Pr( 2 /|x 1 ... n )Pr( 2 / / | 2/ ) (78) 
= Pr(x 1 ... n )Pr(x 1 |x 1 ... n )Pr(y|x 1 ... n )Pr( 2 /» (79) 
=1 

= Pr(a;i... n )Pr(2/|a;i... n )Pr(2/ , |2/) ; (80) 
which equates to the Markov condition Xi... n — >> Y — >> Y'. Altogether, we end up with, 



5({Xi,...,X n ,Xi}:y) = I(X lmm . n :Y)- ™5^ y , l(X 1 ... n :Y / ) (81) 

= S({X u ...,X n }:Y) , (82) 
and the proof of eq. (73) is complete. □ 
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E Nonessential proofs 



These proofs demonstrate some interesting properties of synergistic mutual information, 
however, our paper does not rely on them. 

E.l Proof of zero synergy when Y — X\ ... n 
Objective: Prove that, 

S({X u ...,X n } :Y) =0 whenF = I L , . 

Proof. 

S({X u ...,X n }:Y) = I(Xi... n :y)- min l(X 1 ... n :F / ) (83) 

X 1 ... n ^Y^Y' 
I(Xi:Y')=I(Xi:Y) Vi 

= I(X 1 ... n :X 1 ... n ) min H(Xx... n ) - H(X 1 ... n |F / ) 

X 1 n ^Y^Y' 1 

/(Xiiy'j^cXiiy) Vi 

= H(X 1 ... n )-H(X 1 ... n )+ min ; H(X 1 ... n |F / ) (84) 

X\ ... tt, — yY — vY 
I(Xi:Y')=I(Xi:Y) Vi 

min R(X 1 n \Y f ) (85) 

I(X i :Y , )=I(X i :Y) Vi 

Setting y' = Y = Xx... n puts H(X 1 ... n |F / ) = 
and satisfies all constraints. 

= . (86) 

□ 



27 



